[KV Connector] Canonical KV Cache Allocation for HMA Models#37885
[KV Connector] Canonical KV Cache Allocation for HMA Models#37885Etelis wants to merge 29 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
aa5d414 to
9697d1c
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces canonical KV cache allocation for Hybrid Multi-Attention (HMA) models, specifically targeting those with uniform page sizes. This is a significant improvement as it enables contiguous cross-layer block allocation, which was previously limited to single-group models. The changes involve new data structures to represent canonical KV caches and their references, along with modifications to the KV cache allocation logic within the gpu_model_runner and kv_connector_model_runner_mixin. A comprehensive unit test suite has been added to validate the new allocation strategy under various conditions, including happy paths and rejection cases. The implementation appears well-considered and robust, addressing the stated goal of improving RDMA transfer efficiency by ensuring memory contiguity for HMA models.
|
Hi @Etelis, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
| # The connector must support HMA | ||
| if not supports_hma(get_kv_transfer_group()): | ||
| return False | ||
| if len(kv_cache_config.kv_cache_groups) <= 1: |
There was a problem hiding this comment.
| if len(kv_cache_config.kv_cache_groups) <= 1: | |
| if len(kv_cache_config.kv_cache_groups) < 1: |
| if len(kv_cache_config.kv_cache_groups) <= 1: | ||
| return False | ||
|
|
||
| # All groups must use AttentionSpec with uniform page size |
There was a problem hiding this comment.
| # All groups must use AttentionSpec with uniform page size | |
| # Currently, all groups must use AttentionSpec with uniform page size | |
| # We plan to gradually relax this requirement to support other cases |
| spec = kv_cache_config.kv_cache_groups[0].kv_cache_spec | ||
| assert isinstance(spec, AttentionSpec) |
There was a problem hiding this comment.
Can we remove this and use the spec inside the loop per each group?
| ) | ||
| self.cross_layers_kv_cache = cross_layers_kv_cache | ||
| self.cross_layers_attn_backend = attn_backend | ||
| elif self.use_canonical_kv_caches( |
There was a problem hiding this comment.
Let's move this check before checking use_uniform_kv_cache.
| kernel_num_blocks = num_blocks * num_blocks_per_kv_block | ||
|
|
||
| # prepend a group_size dimension into the shape | ||
| kv_cache_shape = attn_backend.get_kv_cache_shape( |
There was a problem hiding this comment.
Can we move this logic AFTER we allocate the single tensor?
Then, inside the layer loop, we can reshape?
I think we can also remove assert len(unique_kernel_bs) == 1.
I think it's better to also build the group_data_refs inside the same loop.
| @property | ||
| def needs_kv_cache_zeroing(self) -> bool: | ||
| return self.has_mamba_layers | ||
|
|
There was a problem hiding this comment.
These classes are currently specific to connector usage.
I think we should move them to base.py.
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
…nector Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
db277a8 to
5b2b3bc
Compare
| WorkerConnectorInitializationData, | ||
| ) | ||
|
|
||
| kv_transfer_group.initialize_worker_connector( |
There was a problem hiding this comment.
Actually, initialize_worker_connector is needed for the CacheBlend use-case.
Let's try to call it exactly as in #37339.
But keep this if here and simply pass, commenting that the canonical kv caches will be registered below.
There was a problem hiding this comment.
I thought they'd add it themselves afterwards,
nvm I will fix it.
Combine the kv_caches population, block tensor splitting, and layer-to-position mapping into a single pass over positions. Remove the unique kernel block size assertion. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Call initialize_worker_connector unconditionally so connectors like CacheBlend can use it regardless of the allocation path taken. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
30cca9f to
432d002
Compare
| canonical_kv_caches is the CanonicalKVCaches wrapping | ||
| for the connector. | ||
| """ | ||
| # all tensors have the same size (validated by use_canonical_kv_caches) |
There was a problem hiding this comment.
Where did we validate this?
Move the uniform tensor size check into use_canonical_kv_caches so the precondition is validated before entering the allocation path, keeping the assert in allocate_canonical_kv_caches as a safety net. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
orozery
left a comment
There was a problem hiding this comment.
Thanks @Etelis !
Can you please test this PR on top of this branch?
https://github.com/orozery/vllm/tree/kv-offload-hma
Specifically, verify test_cpu_offloading.py passes, and whether we see performance gains.
| [] for _ in kv_cache_config.kv_cache_groups | ||
| ] | ||
|
|
||
| kernel_block_size = kernel_block_sizes[0] |
There was a problem hiding this comment.
Can we initialize kernel_block_size = kernel_block_sizes[gid] inside the loop?
There was a problem hiding this comment.
Done, I didn't think of the fact backends could have different kernel block sizes.
| block_tensor = typed_buffer.select(group_dim, i) | ||
| tensor_idx = len(block_tensors) | ||
| page_bytes = block_tensor[0].numel() * block_tensor.element_size() | ||
| block_tensors.append( |
There was a problem hiding this comment.
Aren't we expecting a single cross-layers tensor? With shape (num_blocks, page_size) and dtype int8?
There was a problem hiding this comment.
Yeah, that's dumb.
Fixed.
Replace per-position KVCacheBlockTensor objects with a single (num_blocks, cross_layer_page_size) int8 tensor. This avoids recomputing block tensors per position and matches the pattern used by the offloading connector's register_cross_layers_kv_cache. Also use per-group kernel_block_sizes[gid] inside the loop instead of hardcoded kernel_block_sizes[0]. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Running on your branch I have hit some issues with the connector not implementing the Ran on top of that branch with an A100
Running other models as well so I'll update soon. |
| for layer_name in kv_cache_tensor.shared_by: | ||
| layer_gid = layer_to_group_idx[layer_name] | ||
| group_data_refs[layer_gid].append( | ||
| KVCacheBlockDataRef( | ||
| tensor_idx=0, | ||
| page_size_bytes=page_size, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
We should have a single data reference per group.
Drop the duplicate KVCacheBlockTensor / KVCacheBlockDataRef / CanonicalKVCaches dataclasses from kv_connector/v1/base.py and import the existing types from vllm.v1.kv_offload.spec. Emit a single data reference per group instead of one per layer. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
…ge size Restore KVCacheBlockTensor / KVCacheBlockDataRef / CanonicalKVCaches in kv_connector/v1/base.py (these types are connector-owned) and fix KVCacheBlockDataRef.page_size_bytes to cover all layers in the group (page_size * group_size) now that we emit a single ref per group. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
- use_canonical_kv_caches: < 2 -> < 1 to allow single-group HMA models - allocate_canonical_kv_caches: read num_blocks from config and assert - allocate_canonical_kv_caches: drop unreachable try/except around get_kv_cache_stride_order (guard already validates it succeeds) Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
orozery
left a comment
There was a problem hiding this comment.
LGTM. Thanks @Etelis !
@NickLucche @heheda12345 WDYT?
This PR extends cross-layers layout to models with multiple groups, but all attention (e.g. gpt-oss).
More importantly, it defines a generic API (CanonicalKVCaches) for describing the KV caches (either cross layers or not) to the connector.
It is meant to replace register_cross_layers_kv_cache, which is kept for now for backward compatibility.
This API could support (without any extending) models using mamba or hybrid mamba/attention. (We plan that to be a follow-up to this PR).
Also, this API can later be extended to include striding information for connectors doing hetro-TP transfers
|
@Etelis Now that the offloading connectors supports HMA, we should add its |
…HMA Models Applied squashed diff from vllm-project#37885 Original author: Itay Etelis <itay.etelis@ibm.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Can review tomorrow, apologies for the delay |
|
Tested performance with gpt-oss-20b on H100: |
NickLucche
left a comment
There was a problem hiding this comment.
I guess one main question, why aren't we folding this into the cross-layer functionality as a natural extension to hma?
It feels like things are tied in the current formulation (canonical implies cross-layer) and yet there ares separate branches in shared code path like the runner
| class CanonicalKVCaches: | ||
| """ | ||
| Canonicalized block-level representation of the KV caches. | ||
|
|
||
| Composed of: | ||
| - Unique list of KV cache data tensors, | ||
| each with shape (num_blocks, page_size_in_bytes) and int8 dtype. | ||
| - Per-group data references of the tensors. | ||
| i.e. how each KV cache group maps to the tensors. | ||
| """ | ||
|
|
||
| # Ordered list of unique block tensors, each with shape | ||
| # (num_blocks, ...). | ||
| tensors: list[KVCacheBlockTensor] | ||
| # Per-KV-cache-group list of data references that map each layer | ||
| # in the group to the appropriate entry in the tensors list. | ||
| group_data_refs: list[list[KVCacheBlockDataRef]] |
There was a problem hiding this comment.
I am not sure these dataclasses about tensors belong here with kv_connector interface. They look a lot more related to whats in kv_cache_manager.py.
I'd rather keep this file lean for the actual interface.
There was a problem hiding this comment.
Connectors need a way to know how to access the KV cache tensors.
Currently, connectors have 2 tasks:
- Determine the topology for each KV cache tensor
- Determine how each group maps to each KV cache tensor (using
KVCacheConfig)
Using the canonical KV caches saves connectors these 2 tasks:
- All tensors are (num_blocks, ) first
- group_data_refs describes how each group maps to tensors.
With cross-layers layout you cannot use KVCacheConfig as the tensors (single one) do not match kv_cache_config.kv_cache_tensors.
| if len(page_sizes) != 1: | ||
| return False |
There was a problem hiding this comment.
isn't this case unexpected if we take UniformTypeKVCacheSpecs out of the equation?
We should probably assert or at least log
There was a problem hiding this comment.
This is the use_canonical_kv_caches function.
The purpose of this function is to determine if should allocate cross-layers or use the regular allocation.
If the spec is UniformTypeKVCacheSpecs we should return False to disable cross layers allocation (for now).
| # num_blocks must be the leading physical dimension. | ||
| # +1 accounts for the prepended group_size dimension. | ||
| if stride_order[0] != kv_cache_shape.index(1234) + 1: | ||
| return False |
There was a problem hiding this comment.
can we unify to use get_kv_cache_block_dim to get the block dim?
There was a problem hiding this comment.
We already get the kv_cache_shape to check if cross-layers is supported, so it seems easier and more efficient to leave this check here.
| if len(stride_order) != len(kv_cache_shape) + 1: | ||
| return False |
There was a problem hiding this comment.
I feel like some of this invariants could be asserted by use_uniform_kv_cache per-group?
There was a problem hiding this comment.
use_uniform_kv_cache should actually be deprecated, along with register_cross_layers_kv_cache.
I suggest that we remove it.
Connectors using prefers_cross_layers_block will simply fall back to the regular register_kv_caches.
What do you think?
This is the first phase of a multi-phase effort to enable contiguous KV cache allocation for all model architectures. Currently, only single-group (uniform) models benefit from contiguous cross-layer blocks. This PR extends that to HMA models with uniform page sizes. Future phases will broaden support to models with varying page sizes and additional architectures.
The existing
allocate_uniform_kv_cachespath only supports single-group models (all layers identical). HMA models like Gemma 3 have multiple KV cache groups (full attention + sliding window) with different eviction policies but the same page size. Previously, these models fell back to per-layer allocation, which scatters block data across non-contiguous memory regions, making RDMA transfers inefficient.This PR extends contiguous KV cache allocation to HMA models where all KV cache groups share the same page size.
Test plan
pytest tests/v1/kv_connector/unit/test_canonical_kv_caches.py -v -suse_canonical_kv_cachesRelated PRs
KVCacheTopologyPR (closed, too complex).WorkerConnectorInitializationDatapattern. We adopt their interface design -- Hopefully to be merged after that PR.